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Abstract 

Training a denoising autoencoder neural network requires access to 
truly clean data, a requirement which is often impractical. To remedy 
this, we introduce a method to train an autoencoder using only noisy 
data, having examples with and without the signal class of interest. The 
autoencoder learns a partitioned representation of signal and noise, learn¬ 
ing to reconstruct each separately. We illustrate the method by denoising 
birdsong audio (available abundantly in uncontrolled noisy datasets) using 
a convolutional autoencoder. 


1 Introduction 

An autoencoder (AE) is a neural network trained in unsupervised fashion, to 
encode its input to some latent representation and to decode that representation 
to a faithful reconstruction of its input. The autoencoder can then be used as a 
codec, or to convert data to its latent representation for downstream processing 
such as classification. The denoising autoencoder (DAE) is a variant of this in 
which the inputs are combined with some corruption (such as additive noise 
or masking), and the system is trained to recover the clean, de-noised data 
[1]. The DAE training scheme can be used in denoising applications, and is 
also a popular way to encourage the autoencoder to learn a more meaningful 
latent representation of the data. Autoencoders including the DAE have yielded 
leading results in recent years in deep learning for signal processing [1, 2]. 

However, there is a significant problem with the DAE approach which ham¬ 
pers its use in practical applications: it may often be impossible to supply 
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truly clean data. This is common in our application example—natural sound 
recordings—but also for video, image and audio applications across many do¬ 
mains. In fact, object s/events are often sparsely represented in data while back¬ 
ground and other noise are densely represented, meaning that it is often easy to 
provide “noise-only” examples while difficult to provide “noise-free” examples. 

In this paper we propose an alternative approach to train an AE so that it 
can perform denoising, given training data which can only be weakly labelled 
as “noise-only” or “noise and possible signal”. The system learns a partitioned 
latent representation, with identifiable noise and signal coefficients, and can 
then perform denoising and/or recover a signal-only latent representation for 
further analysis. The method is general-purpose; we illustrate it here with an 
application to denoising birdsong audio spectrograms. 


2 Partitioned autoencoders 

A standard AE learns a function of the form X = g(/(X)) where X is an input 
datum (a matrix, in this paper), and the autoencoder is composed of encoder 
/(•) and decoder g(-). The DAE learns a function X = g(f(u(X))) where u(-) 
is a stochastic noise corruption process. The training objective is such that 
X is encouraged to be as close to X as possible, often JT ||X* — Xy|| 2 where 
|| • || is the Frobenius norm and the sum is taken over a minibatch of training 
data. (A minibatch is a small subset of training data used for one iteration of 
stochastic gradient descent [SGD].) Given this objective, in practice a DAE will 
learn to denoise its input to the extent that X itself is clean, having no incentive 
to “overshoot” and remove any noise that may be intrinsic to the original X. 
The latent representation output from /(•) parametrises the manifold on which 
the reconstructed signal data he. Information about the noise present in each 
datum is not captured, except implicitly as X — X if X is a good estimate. 

Many equivalent parametrisations of the manifold may be possible, some al¬ 
lowing a more semantic interpretation of each latent coefficient than others, and 
the standard AE or DAE does not distinguish among these parametrisations. 
There has been recent interest in adapting the training schemes of autoencoders 
such that the latent representation is explicitly semantic, capturing attributes 
of the input in specific subsets of the latent variables [3, 4]. We refer to these as 
partitioned autoencoders since the latent variables are partitioned into subsets 
which are treated differently from each other during training. Crucially, in this 
prior work the training scheme relies heavily on the existence of large structured 
datasets: in [4] a balanced dataset of labelled digit images; in [3] a dataset of 
faces constructed through systematic variation of attributes such as pose and 
lighting. Without such known structure in the training data, their proposed 
training schemes will be either impossible to apply, or biased by the presence of 
unbalanced or correlated factors in the training data. 

Our present motivation is to learn a denoising representation, trained using 
data which does not contain truly clean examples. If we can develop a scheme 
that learns to represent both signal and noise, but partitioning them into sep- 
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arate latent coefficients, then we will be able to perform denoising or further 
analysis by using only the “foreground” (signal) coefficients and setting the re¬ 
mainder to zero. The scheme of [3] would be appropriate if the SNR of each 
training example were known and systematically varied, but in uncontrolled 
datasets this information is rarely available. Instead we propose a scheme based 
on the observation that many scenarios consist of sparsely-present foreground 
and densely-present background, and so noise-only training data is much easier 
to come by than signal-only. 

Our training scheme is based on a standard autoencoder reconstruction ob¬ 
jective, augmented with a structured regularisation of the latent variables. In 
order to encourage the model to use the background latents to represent noise 
and never to represent signal, we add a soft regulariser that penalises foreground 
latent activation for the noise-only examples. For each training example X as¬ 
sociated with a weak label y taking the value 1 if the example is a “noise-only” 
example and 0 otherwise, we train an autoencoder by minimising the following 
loss function: 

^X,2/) = ||X-X|| 2 + ^||C©/(X)|| 2 

where X = g(/(X)), A is a regularisation coefficient, 0 represents elementwise 
multiplication, C is a masking matrix containing 1 for latents which should 
represent foreground and 0 otherwise, and C the mean value of C. We will 
construct our training minibatches with a fixed proportion of noise-only items 
at each iteration. The value used for A will be relatively large, to impose a soft 
constraint pushing a subset of the latent values to zero in the case of noise-only 
items (Figure 1, Figure 2). This encourages the learned representation to use 
the non-regularised latents for the foreground signal. The parameters of /(•) 
and g(-) will be optimised through SGD. Once trained, denoising is achieved by 
reconstructing using only the foreground latents, i.e. setting the others to zero 
(Figure 3). 

Figure 2 emphasises that the proportion of latents dedicated to foreground 
vs. background, and the balance of “signal-plus-noise” and “noise-only” exam¬ 
ples in a minibatch, are independent configuration choices. We tend to reserve 
25% of latents for background, and 25% of a minibatch as noise-only; in our 
evaluation we will evaluate the impact of varying the balance of latents. 

Note that the regularisation scheme is asymmetric: it encourages some la¬ 
tents to zero in certain cases, but it does not prevent them from being zero in 
the other cases. In principle the system could simply never use those latents; 
however the reconstruction cost encourages the system to use them to improve 
reconstruction in cases where it has that freedom. The asymmetry also means 
that the important aspect of data labelling is to identify “noise-only” items with 
high precision. If some “noise-only” items are not labelled as such, the system 
is free to represent them without making use of those latents. The scheme can 
thus be used on datasets which are too large to label completely, but in which 
a subset of noise-only examples can be identified. 
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Figure 1: Our training scheme aims to reconstruct all the minibatch items, while 
regularising some of the latents. 

3 Convolutional partitioned autoencoder for au¬ 
dio spectrograms 

The training scheme can be applied to various autoencoder architectures. Con¬ 
volutional neural networks have recently proven to be powerful for many tasks, 
while relatively easy to train because they have far fewer free parameters than 
an equivalent fully-connected network [2, 5]. In our application we aim to ex¬ 
tract information from audio spectrograms, indexed by time and frequency. We 
will thus use an autoencoder which is convolutional in time and fully-connected 
in frequency, as is standard for recent neural network audio analysis [5]. Given 
an input matrix X indexed by discrete time n and frequency h, we define our 
encoding function to be 

/(X) = mp(r(W c * (X — (i)/cr)) 

where W c is a tensor of coding weights indexed by time m frequency h and 
latent index fc, r(-) is a rectified linear unit nonlinearity, mp(-) represents the 
max-pooling operation applied along the time axis, jl and a are the frequency- 
wise means and standard deviations estimated from the training set and used 
for normalisation, and * indicates one-dimensional convolution as follows: 

M H 

(A*B) n , k = ££ [i,N],ke[i,K] 

m= 1 h= 1 

resulting in a matrix indexed by time n and latent index k. 

Our decoding function is 

5 (/(X))=W d *mp- 1 (/(X)) 
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Figure 2: Regularisation is applied to the subset of latents intended to represent 
signal, and only for noise-only examples. 

where W d is a tensor of decoding weights, and mp _1 (-) is the inverse of the 
max-pooling operation (the approximate inverse, as the non-maximal values 
are reconstituted by zeros). 

In this temporally convolutional architecture, the mask matrix C is imple¬ 
mented as identical frame-wise mask matrices, with k indexing the set of latent 
time series. 

Our data will be non-negative. Data normalisation at the input is important 
for effective training, hence the use of (i and a [6]. However we do not undo the 
normalisation at the decoder outputs: the non-negative target and the rectifier 
then give the property that the decoder learns a non-negative parts-based re¬ 
construction. We do not use bias units, as we found the resulting system easier 
to train (cf. [7, 8]). 

For the evaluation that follows, fixed configuration details are: input spec¬ 
trograms have 512 time frames and 32 frequency bins, and we use 32 latent 
variables. Convolution filters have a length of 9 time frames. Our max-pooling 
downsamples the time axis by a factor of 16. We train the network using 
AdaDelta to control the SGD learning rates [9]. We do not use dropout. We 
initialise the tensor of filters as a set of K random orthogonal unit vectors of 
length Mi7, reshaping this to the tensor of shape M x H x K (cf. [10]). 

In this study we explore only a single-layer autoencoder. Our approach 
applies straightforwardly to a deep autoencoder, and a deeper system would be 
expected to have a broader ability to generalise. 

4 Evaluation 

We test our approach using a task to denoise birdsong audio spectrograms (Fig¬ 
ure 4). As foreground we use recordings of chiff chaff (similarly to [11, 12]). 
For evaluation purposes we wish to add a background that offers a substantial 
test: it should be diverse, nonstationary and contain significant energy in the 
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Figure 3: Once trained, the system can denoise by reconstructing using only a 
subset of the latents (setting others to zero). 


same frequency band as the birdsong. After considering many available options 
we settled on recorded restaurant noise, which contains multi-speaker human 
speech as well as diverse percussive sound events. We add this background 
noise, both to create a known amount of “intrinsic” noise in the examples, and 
to create “noise-only” examples. 

Our system is implemented in the Theano framework [13] making use of 
GPU processing. To create a diverse training/validation dataset within the con¬ 
straints of limited GPU memory, we take advantage of the approximate additive 
nature of audio spectrograms as follows. In each experiment we load 30 seconds 
of signal, of intrinsic noise and extrinsic noise, and store their spectrograms to 
the GPU. Then to generate each “signal-plus-noise” datum we randomly sample 
1.5 second segments from each of the three sources and mix them. To generate a 
“noise only” datum we sample only from the extrinsic noise source. This creates 
a large generative dataset within limited memory. We also experimented with 
replacing the 30 second source material regularly throughout training but this 
made little difference, so we do not employ that in the present results. 

Our audio sources use sample rate 22.05 kHz, analysed using 128 bin FFTs 
with 50% hop, and the frequency axis then reduced to 32 bins of interest for the 
birdsong (1.7-7.2 kHz). Within this frequency band, we added intrinsic noise 
to give an SNR -10 dB, then extrinsic noise to give an SNR of -30 dB. The 
extrinsic noise corresponds to the “noise-only” items that would be provided in 
practice. The intrinsic noise is added for evaluation only, to judge whether the 
system can remove it (Figure 5). We study two cases where the intrinsic and 
extrinsic noise sources are matched or unmatched. 

We test our partitioned AE at various settings for the proportion of latents 
regularised, and also a standard DAE using the same training data, configured 
to use the extrinsic noise as the additive corruption u(-) in the DAE training 
process. In all cases we fix A = 0.75 and train with minibatches of size 16, and 
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Figure 4: Example of source separation with the proposed system. Upper: input 
spectrogram, 1.5 seconds (birdsong, plus background noise including a notable 
loud noise near the end). Middle: reconstructed signal spectrogram. Lower: 
reconstructed noise spectrogram. 


10 6 iterations of SGD. 

Results evaluated on separately-sampled validation data (Figure 6) show a 
number of interesting properties. Firstly, all methods perform best against their 
explicit objective in every case: i.e. their best SNR is the one relating to their 
objective function (simple reconstruction for the proposed system, reconstruc¬ 
tion of the partly-clean input for the DAE). However, the statistic of interest 
is reconstruction of the truly-clean spectrogram. In matched conditions (upper 
plot), the proposed system strongly outperforms the DAE on this measure. Its 
strong performance is stable across a broad range of settings for the propor¬ 
tion of latents regularised (all except the extremes). The poor results at the 
setting with regularisation of all latents (100%) confirm that the regularisation 
used is strong: if applied to all latents, it causes underfitting. It is not merely 
the presence of regularisation that improves the results, but the partitioning 
scheme. 

In unmatched conditions (lower plot) the strong performance is not sus¬ 
tained. The DAE continues to perform well at its partial-denoising task, but 
our proposed DAE fails to generalise across diverse background recordings. It 
seems likely that this is due to the small size and depth of the current setup: 
a single-layer autoencoder with only 32 convolutional filters has limited ability 
to approximate arbitrary functions. The strong performance in matched con¬ 
ditions suggests further study of the method with deeper networks. However, 
note that matched conditions are common in applications such as ours, where 
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Figure 5: For evaluation only (not for application), we construct our “sig- 
nal+noise” examples from clean signal plus added intrinsic noise, so that we 
can evaluate SNR against the truly clean signal which the system never ac¬ 
cesses. 


the noise-only examples can be taken from the main field recording sessions. 

5 Conclusions 

We have introduced a partitioned autoencoder that can learn to denoise data 
with better fidelity than a standard DAE, in the common practical case where 
no noise-free examples are available for training. In matched conditions this 
partitioned scheme makes better use of the available data than a DAE. Unlike a 
DAE, our partitioned autoencoder learns to represent the signal and the noise 
content, and can reconstruct either or both. This may be part of its advantage 
over a standard DAE, which does not learn to represent the noise content. 
Further work will explore deeper/wider architectures to improve the generality, 
and use the representations to support classification and other tasks. 
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Figure 6: Reconstruction SNRs measured on the validation data, for the parti¬ 
tioned AE or DAE. Intrinsic and extrinsic noise may be matched (upper plot) 
or unmatched (lower plot). Experiments are performed separately with three 
different noise source recordings, and the error bars show the range across three 
experiments. The large error bars are due to individual differences and not 
random variation (inspected by eye). 


9 

























References 


[1] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, 
“Stacked denoising autoencoders: Learning useful representations in a deep 
network with a local denoising criterion,” The Journal of Machine Learning 
Research , vol. 11, pp. 3371-3408, 2010. 

[2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature , vol. 521, 
no. 7553, pp. 436-444, 2015. 

[3] T. D. Kulkarni, W. Whitney, P. Kohli, and J. B. Tenenbaum, “Deep 
convolutional inverse graphics network,” arXiv preprint arXiv:1503.03167, 
2015. 

[4] B. Cheung, J. A. Livezey, A. K. Bansal, and B. A. Olshausen, “Dis¬ 
covering hidden factors of variation in deep networks,” arXiv preprint 
arXiv:1412.6583 , 2014. 

[5] S. Dieleman and B. Schrauwen, “End-to-end learning for music audio,” in 
Proc ICASSP , 2014, pp. 6964-6968. 

[6] G. Montavon, G. Orr, and K.-R. Muller, Eds., Neural Networks: Tricks of 
the Trade , Springer, 2012. 

[7] T. L. Paine, P. Khorrami, W. Han, and T. S. Huang, “An analysis of 
unsupervised pre-training in light of recent advances,” arXiv preprint 
arXiv:1412.6597 , 2014. 

[8] R. Memisevic, K. Konda, and D. Krueger, “Zero-bias autoencoders and the 
benefits of co-adapting features,” arXiv preprint arXiv:1402.3337, 2014. 

[9] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint 
arXiv:1212.5701, 2012. 

[10] A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the 
nonlinear dynamics of learning in deep linear neural networks,” arXiv 
preprint arXiv:1312.6120, 2013. 

[11] D. Stowell and M. D. Plumbley, “Segregating event streams and noise with 
a Markov renewal process model,” Journal of Machine Learning Research, 
vol. 14, pp. 1891-1916, 2013, preprint arXiv:1211.2972. 

[12] D. Stowell, S. Musevic, J. Bonada, and M. D. Plumbley, “Improved mul¬ 
tiple birdsong tracking with distribution derivative method and Markov 
renewal process clustering,” in Proceedings of the International Confer¬ 
ence on Audio and Acoustic Signal Processing (ICASSP), 2013, preprint 
arXiv:1302.3642. 

[13] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Berg¬ 
eron, N. Bouchard, and Y. Bengio, “Theano: new features and speed 
improvements,” in Proceedings of NIPS 2012, 2012. 


10 



